Non-Strict Cache Coherence: Exploiting Data-Race Tolerance in Emerging Applications
نویسندگان
چکیده
Software distributed shared memory (DSM) platforms on networks of workstations tolerate large network latencies by employing one of several weak memory consistency models. Data-race tolerant applications, such as Genetic Algorithms (GAs), Probabilistic Inference, etc., offer an additional degree of freedom to tolerate network latency: they do not synchronize shared memory references, and behave correctly when supplied outdated shared data. However, these algorithms often have a high communication-to-computationratio and can flood the network with messages in the presence of large message delays. We explore the benefits of designing a DSM with non-strict cache coherence for such applications. We study the performance of controlled asynchronous implementations of these algorithms via the use of a previously proposed blocking Global Read memory access primitive. Global Read implements non-strict cache coherence by guaranteeing to return to the reader a shared datum value from within a specified staleness range; synchronization primitives are thereby avoided. As compared to fully asynchronous implementations, controlled (i.e. partial) asynchrony, implemented using Global Read, reduces the overall amount of computation done with stale data by a process, thus controlling the amount of shared updates (and thereby the network traffic) generated. Experiments on an IBM SP2 multicomputer with an Ethernet interconnect show significant performance improvements for controlled asynchronous implementations. On a lightly loaded network, most of the GA benchmarks see 30% to 40% improvement over the best competitor across configurations ranging from 2 to 16 processors, while two of the Probabilistic Inference benchmarks see more than 80% improvement on a 2-node configuration. As the network load increases, the benefits of non-strict coherence and partial asynchrony increase significantly. Overall, nonstrict cache coherence is indicated to be significantly beneficial over both the data-race-free based weak consistency memory models and fully asynchronous models that have no guarantees regarding coherence. fA shorter version of this document was published in the 29th International Conference on Parallel Processing (ICPP-2000) held in Toronto, Canada, August 21-24, 2000.g
منابع مشابه
Exploiting the Benefits of Multiple-Path Network DSM Systems: Architectural Alternatives and Performance Evaluation
| Modern high performance networks being used for scalable distributed shared memory (DSM) systems support multiple paths to increase bandwidth and/or reduce contention. Such networks violate the constraint of pairwise in-order message delivery implicitly required by many existing directory-based cache coherence protocols. To solve this problem, two alternative strategies are currently used by ...
متن کاملExploiting Data Locality in Adaptive Architectures
The speed of processors increases much faster than the memory access time. This makes memory accesses expensive. To meet this problem, cache hierarchies are introduced to serve the processor with data. However, the effectiveness of caches depends on the amount of locality in the application’s memory access pattern. The behavior of various programs differs greatly in terms of cache miss characte...
متن کاملSoft Coherence: Preliminary Experiments with Error-Tolerant Cache Coherence in Numerical Applications
As we scale into the multi-core era, we face severe challenges in the scalability and performance of on-chip cache-coherent shared memory mechanisms. We explore application error-tolerance as an extra degree of freedom to meet these challenges. Iterative numerical algorithms, in particular, can cope with the occasional stale value with little or no effect on accuracy or convergence time. We exp...
متن کاملTachoRace: Exploiting Performance Counters for Run-Time Race Detection
Fixing data races is a difficult parallel programming problem, even for experienced programmers. At the moment, dynamic race detectors are frequently used because they find races more reliably than other approaches; however, the dynamic approach significantly influences application behavior during debugging because all thread’s memory accesses need to be monitored. Despite using such detectors ...
متن کاملPredicting Data Cache Misses in Non - Numeric
To maximize the beneet and minimize the overhead of software-based latency tolerance techniques, we would like to apply them precisely to the set of dynamic references that suuer cache misses. Unfortunately , the information provided by the state-of-the-art cache miss prooling technique (summary prooling) is inadequate for references with intermediate miss ratios|it results in either failing to...
متن کامل